Zipf's law
Zipf's law is an formulated using that refers to the fact that many types of data studied in the and sciences can be approximated with a Zipfian distribution. For example, Zipf's law states that given some of utterances, the frequency of any word is to its rank in the . Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the is an inverse relation. :For example, in the of American English text, the word is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). Only 135 vocabulary items are needed to account for half the . The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. Zipf's law is one of a family of related discrete s. Zipf distribution is related to the , but is not identical. Starting with an array of exactly 1,000,000 ones and picking 2 at random and combining them until only 100 are left results on average in the distribution above. The results do NOT match Zipf's law (yellow line of dots). Some additional process must be invoked to explain Zipf's law. (intuitively, "the rich get richer" or "success breeds success") that results in the has been shown to fit word frequency versus rank in language and population versus city rank better than Zipf's law. Hapax legomenon . About 44% of the distinct set of words in this novel, such as "matrimonial", occur only once, and so are hapax legomena (red). About 17%, such as "dexterity", appear twice (so-called dis legomena, in blue). predicts that the words in this should approximate a straight line.}} In , a hapax legomenon is a word that occurs only once within a context, either in the written record of an entire language, in the works of an author, or in a single text. Hapax legomenon is a of ἅπαξ λεγόμενον, meaning "(something) being said (only) once". The related terms dis legomenon, tris legomenon, and tetrakis legomenon respectively refer to double, triple, or quadruple occurrences, but are far less commonly used. Hapax legomena are quite common, as predicted by , which states that the frequency of any word in a is inversely proportional to its rank in the frequency table. For large corpora, about 40% to 60% of the words are hapax legomena, and another 10% to 15% are dis legomena. *Thus, in the of American English, about half of the 50,000 distinct words are hapax legomena within that corpus. Hapax legomena in ancient texts are usually difficult to decipher, since it is easier to infer meaning from multiple contexts than from just one. For example, many of the remaining undeciphered are hapax legomena, and Biblical (particularly ; see ) hapax legomena sometimes pose problems in translation. Hapax legomena also pose challenges in . There are about 1,500 Hapax legomena in the ; however, due to Hebrew , and , only 400 are "true" hapax legomena. References Category:Language